Comment data was downloaded and processed (github) from the FCC. We'll do a little exploratory data analysis using that data (over 5GB) and then output some data into csv files for use in creating some graphics.

First, let's load the modules we're going to use and set some global variables:


In [7]:
#Modules
import pandas as pd # data frames
import sqlite3 as sql #sqlite for reading the database
import numpy as np # mathematics
import stats as stats # statistical routines
#import pandasql as pdsql # library to interface pandas and sqlite3 - doesn't work though...

#Globals
FILENAME = '/run/media/potterzot/My Passport/potterzot/data/fcc/nn_comments.db'

The data is stored in an sqlite3 database that we will use as the base to analyze. Initially I tried this using HDF5, but my laptop couldn't handle the memory requirements of so much text, and chunking with pandas wasn't very effective (or perhaps I should say, I wasn't very effective with it).


In [10]:
# Initialize the connection
db = sql.connect(FILENAME)
cursor = db.cursor()

# How many comments were there?
cursor.execute("SELECT COUNT(*) FROM comments")
cursor.fetchone()


Out[10]:
(446719,)

So we've got almost 450,000 comments. I read somewhere that nearly 150,000 were the same though, so let's see how many unique comments we have.


In [13]:
cursor.execute("SELECT COUNT(*) FROM (SELECT DISTINCT text FROM comments)")
cursor.fetchone()


Out[13]:
(445999,)

Hmm, well that's not in accordance with what the FCC said. We should look at that closer. And let's also get the important data moved into a dataframe if possible without breaking the bank


In [23]:
# See what fields are available and save them as a data frame
cursor.execute("PRAGMA table_info(comments)")
metadata = pd.DataFrame(cursor.fetchall(), columns=('id', 'field', 'type', '0', 'None', '02'))
metadata


Out[23]:
id field type 0 None 02
0 0 id TEXT 0 None 0
1 1 applicant TEXT 0 None 0
2 2 applicant_sort TEXT 0 None 0
3 3 author TEXT 0 None 0
4 4 author_sort TEXT 0 None 0
5 5 brief INT 0 None 0
6 6 stateCd TEXT 0 None 0
7 7 city TEXT 0 None 0
8 8 zip TEXT 0 None 0
9 9 daNumber TEXT 0 None 0
10 10 dateCommentPeriod TEXT 0 None 0
11 11 dateReplyComment TEXT 0 None 0
12 12 dateRcpt TEXT 0 None 0
13 13 disseminated TEXT 0 None 0
14 14 exParte INT 0 None 0
15 15 fileNumber TEXT 0 None 0
16 16 lawfirm TEXT 0 None 0
17 17 lawfirm_sort TEXT 0 None 0
18 18 modified TEXT 0 None 0
19 19 pages INT 0 None 0
20 20 proceeding TEXT 0 None 0
21 21 reportNumber TEXT 0 None 0
22 22 regFlexAnalysis INT 0 None 0
23 23 smallBusinessImpact INT 0 None 0
24 24 submissionType TEXT 0 None 0
25 25 text TEXT 0 None 0
26 26 viewingStatus TEXT 0 None 0
27 27 score REAL 0 None 0

Okay, so let's create one data frame that is the geographical and other important data, but not the text itself, since that is the major memory consumer.


In [26]:
# Create the query
q = "SELECT id, applicant, author, city, stateCD AS state, zip, pages, score, dateRcpt AS received FROM comments"
cursor.execute(q)
nontextdata = pd.DataFrame(cursor.fetchall(), columns = ('id', 'applicant', 'author', 'city', 'state', 'zip', 'pages', 'score', 'received'))

In [33]:
# See the number of unique names, cities, states, zips
for v in ['applicant', 'state', 'city', 'zip']:
    print(v+": "+str(len(pd.Series(nontextdata[v]).unique())))


applicant: 355404
state: 28873
city: 60
zip: 27364

28,873 cities and 60 states. A quick look show's we've got a few extra 'states' thrown in the comments.


In [37]:
nontextdata['city'].unique()


Out[37]:
array(['TN', 'CA', 'SC', 'NY', 'VA', 'MI', 'OR', 'FL', 'PA', 'WI', 'MA',
       'NJ', 'WA', 'IN', 'OK', 'RI', 'NC', 'TX', 'CO', 'CT', 'AZ', 'DE',
       'GA', 'OH', 'MD', 'NM', 'ME', 'NH', 'MO', 'SD', 'KY', 'IL', 'IA',
       'AR', 'LA', 'VT', 'AL', 'KS', 'MN', 'NE', 'WV', 'NV', 'HI', 'AK',
       'ID', 'MS', 'UT', 'MT', 'DC', 'WY', 'ND', 'PR', 'VI', '', 'MP',
       'GU', 'FM', 'PW', 'MH', 'AS'], dtype=object)

Histograms can give a qood visual estimate of the distribution among states, though Edward Tufte might argue in favor of a simple table instead.


In [50]:
db.close()